Skip to content

feat(adf): subscription-based provider routing with persona and skill chains#705

Open
AlexMikhalev wants to merge 33 commits intomainfrom
task/28-provider-model-fallback-schema
Open

feat(adf): subscription-based provider routing with persona and skill chains#705
AlexMikhalev wants to merge 33 commits intomainfrom
task/28-provider-model-fallback-schema

Conversation

@AlexMikhalev
Copy link
Contributor

Summary

Stats

  • 21 files changed, +4,645 lines
  • 103 tests passing (92 unit + 11 integration)

ADRs

  • ADR-002: Subscription-only model providers (ban opencode/Zen)
  • ADR-003: Four-tier model routing (Quick/Deep/Implementation/Oracle)
  • ADR-004: Terraphim persona identity layer (8 named personas)
  • ADR-005: kimi-for-coding as implementation tier

Test plan

  • cargo test -p terraphim_spawner -- 92 unit tests pass
  • Integration tests -- 11 tests pass (provider routing, circuit breaker, subscription guard, NDJSON, skills, personas)
  • cargo clippy -p terraphim_spawner -- clean
  • Rebuild adf binary on bigbox and restart service
  • Verify agents start with new provider routing

AlexMikhalev and others added 30 commits March 17, 2026 16:24
Security hardening for terraphim_rlm crate:

1. Created validation.rs module with:
   - validate_snapshot_name(): Prevents path traversal attacks
   - validate_code_input(): Enforces MAX_CODE_SIZE (1MB) limit
   - validate_session_id(): Validates UUID format
   - validate_recursion_depth(): Prevents stack overflow
   - Security constants: MAX_CODE_SIZE, MAX_INPUT_SIZE, MAX_RECURSION_DEPTH

2. Fixed race condition in firecracker.rs:
   - Changed snapshot counter from read-then-write to atomic write lock
   - Added validate_snapshot_name() call before snapshot creation
   - Prevents TOCTOU vulnerability where concurrent snapshots could exceed limit

3. Enhanced mcp_tools.rs:
   - Added MAX_CODE_SIZE validation for rlm_code tool
   - Added MAX_CODE_SIZE validation for rlm_bash tool
   - Returns proper MCP error format for validation failures

Refs #426
Refs: PR #426
Features:
- fcctl-core to terraphim_firecracker adapter
- Sub-500ms VM allocation (267ms measured)
- ULID-based VM ID enforcement
- Full trait implementation with error preservation
- 119 tests passing

Validation: All acceptance criteria met
…erTier enum

Add provider, fallback_provider, fallback_model, and provider_tier fields
to AgentDefinition for subscription-based model routing (ADR-002, ADR-003).
Add ProviderTier enum (Quick/Deep/Implementation/Oracle) with per-tier
timeout values. Add opencode CLI support in spawner arg inference.

All new fields are Optional with serde(default) for backward compatibility.

Fixes #28
Refs #29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add provider allowlist and banned_providers fields to OrchestratorConfig.
Default banned list includes "opencode" (Zen pay-per-use proxy, ADR-002).
Add validate_provider() method to spawner AgentConfig that rejects any
model string starting with a banned prefix, while correctly allowing
opencode-go/ (subscription) and kimi-for-coding/ (subscription).

Fixes #31

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add persona_name, persona_symbol, persona_vibe, meta_cortex_connections,
and skill_chain fields to AgentDefinition for the four-layer identity
stack (Persona/Role/SFIA/Skills). All fields Optional/default for
backward compatibility. See ADR-004.

Fixes #32

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SpawnRequest struct and spawn_with_fallback() method that:
- Validates banned providers before spawn (ADR-002)
- Uses ProviderTier timeout values (Quick=30s, Deep=60s, Impl=120s, Oracle=300s)
- Retries with fallback provider/model on primary failure
- Integrates per-provider circuit breakers (3 failures = open 5 min)
- Returns SpawnerError on both primary and fallback failure

SpawnRequest avoids circular dependency with terraphim_orchestrator
by mirroring needed AgentDefinition fields in the spawner crate.

Fixes #30

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add persona_name, persona_symbol, persona_vibe, and meta_cortex_connections
to SpawnRequest. The build_persona_prefix() function generates a markdown
identity block that is prepended to the agent task prompt when persona
fields are configured. Agents without persona config are unaffected.

Fixes #33

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SkillChainRegistry with 31 terraphim-skills and 16 zestic-skills.
Provides validate_chain() to verify skill chains and validate_skill_chains()
on OrchestratorConfig to validate all agents. Backward compatible -- agents
with empty skill_chain pass validation.

Fixes #34

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add OpenCodeEvent struct with parsing for opencode run --format json
output events (step_start, text, tool_use, step_finish, result).
Includes text_content(), total_tokens(), parse_line(), and parse_lines()
helper methods for structured output consumption.

Fixes #37

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SkillResolver in skill_resolver.rs that maps skill chain names to actual
skill file paths from the terraphim-skills repository.

Features:
- SkillResolver with registry of terraphim skills (security-audit, code-review,
  session-search, local-knowledge, git-safety-guard, devops, disciplined-research,
  architecture, disciplined-design, requirements-traceability, testing,
  acceptance-testing, documentation, md-book, implementation, rust-development,
  visual-testing, quality-gate)
- resolve_skill_chain() method that takes Vec<skill_name> and returns
  resolved skill metadata (name, description, applicable_to, paths, source)
- Validation of skill chains with proper error handling
- Comprehensive test coverage including valid resolution, missing skill errors,
  and empty chain handling
- Exported from crate as SkillResolver, SkillSource, ResolvedSkill,
  SkillResolutionError

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…h Refs #36

Extend SkillResolver to support zestic-engineering-skills from zestic-ai/6d-prompts
repository alongside existing terraphim-skills.

Features:
- Add SkillSource enum (Terraphim, Zestic) to distinguish skill origins
- Initialize zestic skills registry with 12 skills: quality-oversight,
  responsible-ai, insight-synthesis, perspective-investigation, product-vision,
  wardley-mapping, business-scenario-design, rust-mastery, cross-platform,
  frontend, via-negativa-analysis, strategy-execution
- Each resolved skill includes its source metadata
- SkillChainRegistry validation accepts skills from both sources
- Comprehensive tests for mixed skill chains (terraphim + zestic together)

Tests added:
- test_resolver_has_zestic_skills
- test_resolve_zestic_skill
- test_resolve_mixed_skill_chain
- test_validate_mixed_skill_chain_valid
- test_validate_only_zestic_skills
- test_mixed_chain_with_invalid_skills
- test_zestic_skill_source_in_resolved
- test_all_skill_names_includes_zestic
- test_zestic_skill_structure
- test_resolver_custom_paths_for_zestic

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…routing Refs #39

Replace all legacy codex CLI references with opencode + subscription providers.
Add persona, skill_chain, provider/model/fallback fields to all agents.
New agents: compliance-watchdog, drift-detector, spec-validator, test-guardian,
documentation-generator, implementation-swarm, compound-review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SSH non-interactive shells (systemd) do not include ~/.bun/bin or
~/.local/bin in PATH. Use full paths for opencode and claude binaries.
Also fix TOML escape in --since quote.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Creates crates/terraphim_judge_evaluator with:
- SimpleAgent struct wrapping terraphim_router for KG lookups
- KgMatch struct for representing term matches
- lookup_terms() method using Aho-Corasick automata
- enrich_prompt() for appending KG context to judge prompts
- Comprehensive tests for all functionality

Also adds automation/judge/model-mapping.json configuration file.
Implements JudgeModelRouter for tier-based LLM model selection:
- TierConfig struct for provider+model pairs
- ModelMappingConfig for complete configuration
- JudgeModelRouter with methods:
  * from_config(path) - load from JSON file
  * resolve_tier(tier) - get provider/model for a tier
  * resolve_profile(profile) - get tier sequence for a profile
- Supports tiers: quick, deep, tiebreaker, oracle
- Supports profiles: default, thorough, critical, exhaustive
- Comprehensive tests for all functionality
- Updates automation/judge/model-mapping.json with profile definitions
Implements JudgeAgent as a supervised agent combining:
- SimpleAgent for Knowledge Graph context enrichment
- JudgeModelRouter for tier-based model selection
- SupervisedAgent trait from terraphim_agent_supervisor

JudgeVerdict struct with:
- verdict: final evaluation result (PASS/FAIL/NEEDS_REVIEW)
- scores: BTreeMap of detailed category scores
- judge_tier: which tier produced the verdict
- judge_cli: command used for evaluation
- latency_ms: evaluation duration

Evaluation pipeline:
1. Load file content
2. Enrich with KG context (optional)
3. Select model tier based on profile
4. Parse verdict from CLI output (JSON or text format)

Includes comprehensive tests for:
- Verdict parsing and manipulation
- Supervised agent lifecycle
- Profile-based tier selection
- System message handling
- Add stagger_delay_ms config field (default: 5000ms)
- Insert stagger delay between Safety agent spawns in run()
- Add random jitter (0 to stagger_delay_ms) for Core agent cron spawns
- Add tests for stagger delay configuration

Refs #16
- Add ReviewRequest struct with from_agent, to_agent, artifact_path, review_type
- Add ReviewPair config for defining (producer, reviewer) pairs
- Add review_queue: Vec<ReviewRequest> to orchestrator state
- Add submit_review_request(), review_queue(), process_review_queue() methods
- Add check_review_trigger() to automatically queue reviews on agent completion
- Add tests for review queue operations and config loading

Refs #17
- Add drift_detection module with DriftDetector and DriftReport
- Load strategic goals from plans/ directory markdown files
- Check drift every N ticks (configurable via drift_detection.check_interval_ticks)
- Calculate drift score by comparing agent outputs against strategic goals
- DriftReport includes agent, drift_score, and explanation
- Log warnings when drift_score exceeds threshold (default: 0.6)
- Add DriftDetectionConfig to OrchestratorConfig
- Add tests for drift detection functionality

Refs #18
- Add session_rotation module with SessionRotationManager and AgentSession
- Add SessionRotationConfig with max_sessions_before_rotation (default: 10)
- Track completions_since_rotation and completed_sessions per agent
- on_agent_completion() records completion and triggers rotation at threshold
- Rotation creates new session ID and clears accumulated context
- Add comprehensive tests for session rotation

Refs #19
- Add convergence_detector module with ConvergenceDetector and ConvergenceSignal
- Add ConvergenceConfig with threshold (default: 0.95) and consecutive_threshold (default: 3)
- Calculate output similarity using Jaccard index on word sets
- Detect convergence after N consecutive similar outputs
- ConvergenceSignal includes agent, similarity, and consecutive_count
- Reset on divergence to handle changing outputs
- Add comprehensive tests for convergence detection

Refs #20
…iven mode

Issue #8: Extend orchestrator config for issue-driven mode
- Add WorkflowConfig with mode (time_only/issue_only/dual), poll_interval_secs, max_concurrent_tasks
- Add TrackerConfig with tracker_type (gitea/linear), url, token_env_var, owner, repo
- Add ConcurrencyConfig with max_parallel_agents, queue_depth, starvation_timeout_secs
- All optional in orchestrator.toml with backward compatible defaults (time_only mode is default)
- Tests: parse config with and without workflow section, defaults applied

Issue #9: Unified dispatcher
- Create crates/terraphim_orchestrator/src/dispatcher.rs
- DispatchTask enum: TimeTask(agent_name, schedule), IssueTask(agent_name, issue_id, priority)
- DispatchQueue: priority queue backed by BinaryHeap with round-robin fairness
- ConcurrencyController: semaphore-based with starvation timeout
- Methods: submit(task), next() -> Option<DispatchTask>, active_count(), is_full()
- Fairness: alternates between time and issue tasks at equal priority
- Tests: submit/dequeue, priority ordering, concurrency limits, fairness

Issue #10: Issue mode controller
- Create crates/terraphim_orchestrator/src/issue_mode.rs
- IssueMode struct using terraphim_tracker::GiteaTracker to poll for issues
- Poll loop every poll_interval_secs fetching ready issues via PageRank sorting
- Filter out blocked issues and already-running tasks
- Map issues to agents based on labels ([ADF] -> implementation-swarm) or title patterns
- Submit IssueTask to DispatchQueue
- Tests: poll cycle, issue-to-agent mapping, priority calculation, blocked issue filtering

Issue #11: Time mode refactor
- Refactor crates/terraphim_orchestrator/src/scheduler.rs
- Extract TimeMode struct wrapping existing cron scheduler
- TimeMode submits TimeTask to DispatchQueue instead of spawning directly (when configured)
- Maintain backward compatibility: legacy mode spawns directly if no WorkflowConfig
- Tests: TimeMode submits to queue, legacy mode still works

Refs #8 #9 #10 #11
- Add ModeCoordinator that manages TimeMode and IssueMode
- Implement unified shutdown: signal modes, drain queue, wait for active tasks
- Extend spawner integration to dispatch tasks from queue to agents
- Add stall detection: log warning when queue exceeds threshold
- Add comprehensive tests for dual mode, shutdown coordination, and stall detection

Refs #12
Create comprehensive E2E test suite in tests/e2e_tests.rs:
- test_dual_mode_operation: verify both time and issue tasks processed
- test_time_mode_only: legacy config compatibility
- test_issue_mode_only: issue-only config verification
- test_fairness_under_load: no starvation between task types
- test_graceful_shutdown: clean termination with queue draining
- test_stall_detection: warning logged when queue exceeds threshold
- Additional tests for concurrency limits, prioritization, and backward compatibility

Uses mock tracker and avoids real API calls.

Refs #13
AlexMikhalev and others added 3 commits March 20, 2026 20:55
- Create src/compat.rs: Symphony compatibility layer with type aliases, adapters
- Add migration helpers for enabling dual mode from legacy configs
- Create MIGRATION.md with comprehensive migration guide
- Update CLAUDE.md with dual mode architecture description
- Include backward compatibility notes and configuration examples
- Export compat module from lib.rs

Refs #14
- Run cargo test --workspace (orchestrator tests pass)
- Fix all warnings in orchestrator crate
- Create CHANGELOG.md entry for v1.9.0 release
- Verify backward compatibility with test_backward_compatibility
- Build release binary: cargo build --release -p terraphim_orchestrator
- All 142 tests pass in orchestrator crate

Release includes:
- Dual mode orchestrator with TimeMode and IssueMode
- Unified dispatch queue with fairness
- Stall detection and graceful shutdown
- Symphony compatibility layer
- Comprehensive E2E test suite
- Migration documentation

Refs #15
- Fix needless borrows for generic args in workspace crate
- Apply consistent code formatting across modified files
Copy link
Contributor Author

@AlexMikhalev AlexMikhalev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #705 Review: Subscription-Based Provider Routing

Reviewer: Automated code review
Verdict: Changes requested -- several issues need addressing before merge


File Hygiene (HIGH)

117 of 205 changed files are NOT relevant to this PR:

  • 101 files: terraphim_server/dist/assets/bulmaswatch/ -- CSS theme source files that should not be in version control (or at minimum should not be in this feature PR)
  • 5 files: terraphim_server/dist/assets/*.js -- compiled frontend bundles
  • 8 files: reports/upstream-sync-*.md, reports/coordination-*.md, reports/security-*.md -- ephemeral agent-generated reports
  • 3 files: Root-level upstream-sync-*.md, coordination-*.md -- duplicates of the above
  • 1 file: .deployment-marker -- deployment state file
  • 1 file: ~/.config/terraphim/sessions/telegram_287867183.jsonl -- personal session data leaked into PR

Action: Remove all unrelated files from this PR. The dist/ directory should be in .gitignore. The reports belong in a separate artifact store or CI output.


Architecture (MEDIUM)

Good

  • Clean three-crate split: terraphim_spawner (process management), terraphim_orchestrator (scheduling), terraphim_tracker (issue tracking)
  • Circuit breaker design (Closed -> Open after 3 failures -> Half-Open after 5min) is sound
  • Four-tier provider routing (Quick/Deep/Implementation/Oracle) with fallback chains is well-structured
  • ADRs are clear and well-motivated

Issues

  1. Duplicated ProviderTier enum across terraphim_spawner and terraphim_orchestrator. Extract to terraphim_types to prevent drift.

  2. SkillResolver hardcodes all 30 skills (18 terraphim + 12 zestic) as string literals in Rust code. Adding a skill requires recompile. The filesystem paths are configured but not used for discovery.

  3. Circuit breaker state is in-memory only -- resets on process restart. Acceptable for now but document the trade-off.


Security (CRITICAL + MEDIUM)

CRITICAL: PermittedProviderFilter uses substring matching

The filter uses terraphim_automata::find_matches() which performs Aho-Corasick substring matching. A provider string like "not-opencode-go" would match "opencode-go" in the thesaurus and pass validation. A malicious provider name containing a permitted substring would bypass the guard.

Fix: Switch to exact match via HashSet::contains on provider names, or validate that the match spans the entire input string. Automata-based substring matching is the wrong tool for an allowlist check.

MEDIUM: Path traversal in sfia_metaprompt

At lib.rs:744, sfia_metaprompt is joined to working_dir and read from disk without sanitization:

let full_path = working_dir.join(metaprompt_path);
std::fs::read_to_string(&full_path)

If config ever comes from an untrusted source, ../../etc/passwd would read arbitrary files. Add path canonicalization and verify the resolved path stays within working_dir.


Code Quality (HIGH + MEDIUM)

HIGH: Mutex::unwrap() in production code (health.rs)

Lines 335, 341, 348, 354, 372 all use .lock().unwrap(). A poisoned mutex will panic the health check loop. Use .lock().unwrap_or_else(|e| e.into_inner()) or switch to parking_lot::Mutex (no poisoning).

MEDIUM: Dead mpsc channel in OutputCapture

At output.rs:47:

let (event_sender, _event_receiver) = mpsc::channel(100);

The receiver is immediately dropped. All subsequent event_sender.send() calls silently fail. Either remove the channel entirely (broadcast is the real delivery) or wire up the receiver.

MEDIUM: Thesaurus cloned on every is_permitted() call

At lib.rs:136, thesaurus.clone() is called for every spawn request validation. If this PR switches to HashSet (per the security fix), this becomes moot.


Testing (MEDIUM)

Good

  • 92 unit + 11 integration tests is solid coverage
  • Tests for persona injection (full/partial/none), skill chains, NDJSON parsing, subscription guard
  • Using echo/cat as mock agent binaries is a pragmatic approach

Missing

  1. No half-open/recovery test for CircuitBreaker -- only Closed->Open is tested. The cooldown timer expiry and success-in-half-open paths are untested.

  2. claude_session_tests.rs re-implements types locally instead of importing from the crate. If production parsing diverges from these local structs, tests will pass while production breaks.

  3. Weak assertion in test_parse_malformed_json_returns_none:

result.is_none() || input.is_empty()

The || input.is_empty() clause makes the assertion trivially true for empty strings.


Config (MEDIUM)

orchestrator.toml has hardcoded absolute paths

Multiple references to /home/alex/terraphim-ai, /home/alex/.local/bin/claude, /home/alex/.bun/bin/opencode, /opt/ai-dark-factory/. These only work on one machine.

Fix: Use environment variable substitution ($HOME, $TERRAPHIM_DIR) or make paths relative to working_dir.


Dependencies (LOW)

  • New deps (nix, regex, chrono, uuid, cron) are all justified
  • terraphim_automata is heavy for what is essentially a set membership check in PermittedProviderFilter -- if switching to HashSet, this dep can be removed from spawner

Summary of Required Actions

Priority Issue Action
CRITICAL Substring matching in provider filter Switch to exact match
HIGH 117 unrelated files in PR Remove from branch, update .gitignore
HIGH Mutex::unwrap() in health.rs Use poison-safe locking
MEDIUM Dead mpsc channel in OutputCapture Remove or wire up
MEDIUM Duplicated ProviderTier enum Extract to terraphim_types
MEDIUM Path traversal in sfia_metaprompt Canonicalize and validate
MEDIUM No circuit breaker half-open test Add recovery test
MEDIUM Hardcoded paths in orchestrator.toml Use env vars or relative paths
LOW Hardcoded skill registry Load from filesystem

Good foundational work on the provider routing and persona system. The architecture is sound and the ADRs are well-written. Addressing the security issue with substring matching and cleaning up the file hygiene would make this ready for merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create article index works differently when called from Axum server and from ripgrep middleware

1 participant